Introduction

Before we even choose a model algorithm for our species distribution model (SDM), we need to have data to fit it to. For an SDM we need two types of data: species observation records (presence-only, presence-absence, or abundance) and environmental data (such as bioclimatic variables or vegetation maps, in the form of rasters). In some instances we might be using data we recorded ourselves, other times we might want to use remotely sensed data, or even a combination of the two. There are multiple sources of data available, and zoon gives you the ability to access many of them.

With zoon you are able to use your own data, data sourced from online repositories, and some modules in zoon that provide some occurrence and environmental data for you to use. This tutorial will guide you through the data sources available within a zoon workflow().


zoon Modules

zoon comes with several pre-existing dataset modules that we can use. Using the GetModuleList() command we can see all of the Occurrence modules under the $occurrence sub-heading, and this includes the dataset modules. We can also view the Covariate modules under the $covariate sub-heading.

modules <- GetModuleList()
modules$occurrence            # subset for the sake of screen space
##  [1] "CarolinaWrenPO"             "CWBZimbabwe"               
##  [3] "LocalOccurrenceData"        "Lorem_ipsum_UK"            
##  [5] "NaiveRandomPresence"        "NaiveRandomPresenceAbsence"
##  [7] "SugarMaple"                 "UKAnophelesPlumbeus"       
##  [9] "CarolinaWrenPA"             "NATrees"                   
## [11] "AnophelesPlumbeus"          "SpOcc"
modules$covariate
## [1] "Bioclim_future"      "CarolinaWrenRasters" "LocalRaster"        
## [4] "NaiveRandomRaster"   "UKBioclim"           "NCEP"               
## [7] "UKAir"               "AirNCEP"             "Bioclim"

For example, we could choose to fit a model to the Carolina Wren data using the CarolinaWrenPO or CarolinaWrenPA occurrence modules (presence-only and presence-absence, respectively) with the CarolinaWrenRasters covariate module.

While these are perfectly usable datasets, they are just useful examples and not something we would fit an SDM to for the sake of the results. These are most useful for experimenting with zoon modules (want to explore the differences in model algorithms? Run them on these example dataset modules and compare the outputs), or as test datasets when building new modules of your own.


Our Own Data

SDMs are commonly fit to datasets we have collected ourselves, and zoon has modules to help us load them. The two modules of interest here are LocalOccurrenceData for our observation records and LocalRaster for our raster-based data.

To ensure that all datasets loaded in to a zoon workflow are compatible with the model modules, the LocalOccurrenceData module requires our data to be a .csv/.xlsx/.tab/.xlsx file with a strict structure. The first and second columns are the longitude and latitude values (in that order), and the third column is the value of the observation (0 for absence, 1 for presence, and an integer for abundance data). If your coordinate system is not latitude/longitude then you can supply an optional fourth column called CRS that contains the proj4string for your coordinate system (e.g. “+init=epsg:27700” for easting/northing data). If no CRS column is supplied then latitude/longitude is assumed.

To use this module you call the occurrence module like this:

occurrence = LocalOccurrenceData(filename = "myData.csv",         # File path to your data file
                                 occurrenceType = "presence",     # The type of data you have
                                 columns = c(long = "longitude",  # The names of the columns in 
                                             lat = "latitude",    #     your .csv that much the 
                                             value = "value"),    #     required columns
                                 externalValidation =  FALSE)     # Only required if validation
                                                                  #     data is set up externally

Raster data loaded into a workflow using LocalRaster also follows a set format, but it is a simpler process than for occurrence data. This module reads in either a single raster or raster-stack, or a list or vector of rasters and creates a raster-stack.

To use this module you call the covariate module like this:

covariate = LocalRaster(rasters = c("myRaster1",     # Filepath to a raster
                                    "myRaster2"))    # Filepath to a second raster

covariate = LocalRaster(rasters = "myRasterStack")   # A RasterStack object already loaded

Online Repositories

Sometimes we may need or want to source our occurrence and/or covariate data from online sources. The modules of interest here are SpOcc, Bioclim, Bioclim-future, and NCEP.

The SpOcc module is used to obtain species occurrence records from a selection of online data bases. The available databases are GBIF, BISON, iNat, eBird, Ecoengine, and AntWeb. We can call this module like this:

occurrence = SpOcc(species = "SpeciesName",   # Species scientific name
                   extent = c(-1, 0, 51, 52), # Coordinates for the extent of the region
                   databases = "gbif",        # List of data bases to use
                   type = "presence",         # Type of data you want
                   limit = 10000)             # A maximum limit of records to obtain

The Bioclim module obtains bioclimatic variables from WorldClim. This data is available at various resolutions (2.5, 5, or 10 minutes), and there are 19 available variables. We can call this module like this:

covariate = Bioclim(extent = c(-180, 180, -90, 90), # Coordinates for the extent of the region
                    resolution = 10,                # Required resolution
                    layers = 1:5)                   # Variables we want (between 1-19)

We can also obtain these bioclimatic variables for predictions of the future using the Bioclim_future module. Anyone looking to obtain this data should first research Representative Concentration Pathways and General Circulation Models to make an informed decision about the predictions they are after. We can call this module like this:

covariate = Bioclim_future(extent = c(-10, 10, 45, 65),  # Coordinates of the extent of the region
                           resolution = 10,              # Resolution of the data
                           layers = 1:19,                # Required Bioclim variables
                           rcp = 45,                     # Representative Concentration Pathways
                           model = "AC",                 # General Circulation Models
                           year = 70)                    # Time period for the prediction

The NCEP module obtains environmental data from the National Centers for Environmental Prediction. We can call this module like this:

covariate = NCEP(extent = c(-5, 5, 50, 60),     # Coordinates of the extent of the region
                 variables = "hgt",             # Character cevtor of variables of interest
                 status.bar = FALSE)            # Show a status bar of download progress?

Example

Now that we’ve seen the different ways of obtaining data in zoon, lets see an example. Here we obtain presence-only data for the Grizzly Bear, Ursus arctos in North America from GBIF, and bioclimatic variables from Wordclim.

Ursus_arctos_online <- workflow(occurrence = SpOcc(species = "Ursus arctos",
                                              extent = c(-175, -65, 20, 75),
                                              databases = "gbif",
                                              type = "presence"),
                                covariate = Bioclim(extent = c(-175, -65, 20, 75),
                                               resolution = 10,
                                               layers = 1:19),
                                process = Chain(StandardiseCov, Background(1000)),
                                model = MaxEnt,
                                output = InteractiveOccurrenceMap)